one.dice <- function(){
dice <- sample(1:6, size = 1, replace = TRUE)
return(dice)
}
one.dice() # It is simulating a dice roll.[1] 4
Discuss these following examples with your class mates and explain each case and comment on the results.
A multinomial distribution is a distribution that shows the likelihood of the possible results of a experiment with repeated trials in which each trial can result in a specified number of outcomes that is greater than two. A multinomial distribution could show the results of tossing a dice, because a dice can land on one of six possible values. By contrast, the results of a coin toss would be shown using a binomial distribution because there are only two possible results of each toss, heads or tails.
Two additional key characteristics of a multinomial distribution are that the trials it illustrates must be independent (e.g., in the dice experiment, rolling a five does not have any impact on the number that will be rolled next) and the probability of each possible result must be constant (e.g., on each roll, there is a one in six chance of any number on the die coming up).
We are rolling a die 100 times.
one.dice <- function(){
dice <- sample(1:6, size = 1, replace = TRUE)
return(dice)
}
one.dice() # It is simulating a dice roll.[1] 4
# There are 4 simulations on 100 dice rolls
par(mfrow=c(2,2))
for (i in 1:4){
sims <- replicate(100, one.dice())
table(sims)
table(sims)/length(sims)
plot(table(sims), xlab = 'Event', ylab = 'Frequency')
}We are rolling a die 10000 times
# This time there are 4 simulations on 10000 dice rolls. Compared to b, the distribution of each number on dice is similar.
par(mfrow=c(2,2))
for (i in 1:4){
sims <- replicate(10000, one.dice())
table(sims)
table(sims)/length(sims)
plot(table(sims), xlab = 'Event', ylab = 'Frequency')
}From the class example
Let’s say that Molly, Ryan and Mr.Bob are buying beer(x1) ,bread(x2) and coke(x3) with probabilities (3/5,1/5,1/5).
3*(3/5)*(1/5)^2[1] 0.072
dmultinom(c(1,2,0),prob=c(3/5,1/5,1/5))[1] 0.072
So these two probability are same.
df = data.frame(t(data.frame(rmultinom(10000,3,prob=c(3/5,1/5,1/5)))))
colnames(df) = c("Beer","Bread","Coke")
hist(df$Beer, main = "Marginal distribution of x1")Discuss this with your class mates and comment on the Plots. What can you observe from each plot?
Helpful links to answer this question:
-> Contour plot also gives the densities.
https://blog.revolutionanalytics.com/2016/02/multivariate_data_with_r.html
-> Then we have these ellipses; the circular symmetric version of complex normal distribution. https://en.wikipedia.org/wiki/Elliptical_distribution
ellipse: https://en.wikipedia.org/wiki/Ellipse
-> http://cs229.stanford.edu/section/gaussians.pdf
This tells you how when correlation coefficient increases the distribution spread and how the ellipses look like. -> https://online.stat.psu.edu/stat505/book/export/html/636
library(tidyverse)── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.2 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.4.2 ✔ tibble 3.2.1
✔ lubridate 1.9.2 ✔ tidyr 1.3.0
✔ purrr 1.0.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(mvtnorm)
library(plotly)
Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
library(MASS)
Attaching package: 'MASS'
The following object is masked from 'package:plotly':
select
The following object is masked from 'package:dplyr':
select
library(ggplot2)Source: https://data-se.netlify.app/2018/12/13/visualizing-a-multivariate-normal-distribution/
First, let’s define a covariance matrix Σ:
sigma <- matrix(c(4,2,2,3), ncol = 2)
sigma [,1] [,2]
[1,] 4 2
[2,] 2 3
Then, simulate observations n = n from these covariance matrix; the means need be defined, too. As the rank of our covariance matrix is 2, we need two means:
means <- c(0, 0)
n <- 1000
set.seed(42)
x <- rmvnorm(n = n, mean = means, sigma = sigma)
str(x) num [1:1000, 1:2] 2.314 1.053 0.716 2.848 3.839 ...
head(x) [,1] [,2]
[1,] 2.3139150 -0.15442375
[2,] 1.0527522 1.24094662
[3,] 0.7162789 0.05340542
[4,] 2.8479495 0.69465309
[5,] 3.8388378 1.03195246
[6,] 3.7900042 4.47972608
You can see that X is bivariately normal distributed.
Let’s make a data frame out of it:
d <- data.frame(x)
names(d)[1] "X1" "X2"
## marginal of X1
d %>%
ggplot(aes(x = X1)) +
geom_density()It is plotting the density of X1 and it is very close to normal distribution.
p1 <- data_frame(x = -3:3) %>%
ggplot(aes(x = x)) +
stat_function(fun = dnorm, n = n)Warning: `data_frame()` was deprecated in tibble 1.1.0.
ℹ Please use `tibble()` instead.
p1This theoretical standard normal distribution with standard deviation at 1 and mean at 0. 3.a. plot is very close to this.
p2 <- ggplot(d, aes(x = X1, y = X2)) +
geom_point(alpha = .5) +
geom_density_2d()
p2It is plotting a joint distribution of x1 and x2 with contour lines. Points are clustering around the mean and distributed like normal distribution.
Geom binhex https://ggplot2.tidyverse.org/reference/geom_hex.html
p3 <- ggplot(d, aes(x = X1, y = X2)) +
geom_point(alpha = .5) +
geom_bin2d() +
scale_fill_viridis_c()
p3It is similar to 3.c, yet it shows the density of points by color.
(p <- plot_ly(d, x = ~X1, y = ~X2))No trace type specified:
Based on info supplied, a 'scatter' trace seems appropriate.
Read more about this trace type -> https://plotly.com/r/reference/#scatter
No scatter mode specifed:
Setting the mode to markers
Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode
add_histogram2d(p)First plot is a scatter plot of x1 and x2. Second one is a heat map showing the density of 1x1 area.
add_histogram2dcontour(p)It is a heat map with contour lines. Having contour lines, it has smoother boundaries.
dens <- kde2d(d$X1, d$X2)
plot_ly(x = dens$x,
y = dens$y,
z = dens$z) %>% add_surface()It is a plot making 3.g into 3d. When there are more points, it has a high number on z axis.
First, compute the density of each (X1, X2) pair.
d$dens <- dmvnorm(x = d)Now plot a point for each (X1, X2, dens) tuple.
p4 <- plot_ly(d, x = ~ X1, y = ~ X2, z = ~ dens,
marker = list(color = ~ dens,
showscale = TRUE)) %>%
add_markers()
p4It is plotting individual points in 3d with density. If there are more points(more density), the color changes to red.
Try Topic Modeling on the HPCorpus. Where I have included Harry Potter Texts and the Lord of the Ring texts.
This article explains Topic Modeling in R very clearly (please follow it) https://www.tidytextmining.com/topicmodeling.html
Also,please follow this article on “stopping words” https://smltar.com/stopwords.html